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Abstract. We study the structure of social networks of students by examining 
the graphs of Facebook "friendships" at five American universities at a single point in 
time. We investigate each single-institution network's community structure and em- 
ploy graphical and quantitative tools, including standardized pair-counting methods, to 
measure the correlations between the network communities and a set of self-identified 
user characteristics (residence, class year, major, and high school). We review the 
basic properties and statistics of the pair-counting indices employed and recall, in 
simplified notation, a useful analytical formula for the z-score of the Rand coefficient. 
Our study illustrates how to examine different instances of social networks constructed 
in similar environments, emphasizes the array of social forces that combine to form 
"communities, " and leads to comparative observations about online social lives that 
can be used to infer comparisons about offline social structures. In our illustration 
of this methodology, we calculate the relative contributions of different characteristics 
to the community structure of individual universities and subsequently compare these 
relative contributions at different universities, measuring for example the importance 
of common high school affiliation to large state universities and the varying degrees of 
influence common major can have on the social structure at different universities. The 
heterogeneity of communities that we observe indicates that these networks typically 
have multiple organizing factors rather than a single dominant one. 

1. Introduction. Social networks are a ubiquitous part of everyday life. Al- 
though they have long been studied by social scientists [36], the mainstream aware- 
ness of their ubiquity has arisen only recently, in part because of the rise of social 
networking sites (SNSs) on the World Wide Web. Since their introduction, SNSs such 
as Fricndstcr, MySpace, Facebook, Orkut, Linkcdin, and hundreds of others have at- 
tracted hundreds of millions of users, many of whom have integrated SNSs into their 
daily lives to communicate with friends, send e-mails, solicit opinions or votes, or- 
ganize events, spread ideas, find jobs, and more [2]. Facebook, an SNS launched in 
February 2004, now overwhelms numerous aspects of everyday life, having become an 
especially popular obsession among college and high school students (and, increas- 
ingly, among other members of society) [l][2j[23l[25] . Facebook members can create 
self-descriptive profiles that include links to the profiles of their "friends," who may 
or may not be offline friends. Facebook requires that anybody who one wants to add 
as a friend confirm the relationship, so Facebook friendships define a network (graph) 
of reciprocated tics (undirected edges) that connect individual users. 



The global organization of real- world networks typically includes coexisting mod- 
ular (horizontal) and hierarchical (vertical) organizational structures [51151 [251 1501 [55] . 
Myriad papers have attempted to interpret such organization through the compu- 
tation of structural modules or communities [8ll33|. which arc defined in terms of 
mesoscopic groups of nodes with more internal connections (between nodes in the 
group) than external connections (between nodes in the group and nodes in other 
groups). Such communities, which are not typically identified in advance, are often 
considered to not be merely structural modules but are also expected to have func- 
tional importance because of the large number of common ties among nodes in a 
community. Additionally, prior empirical studies have observed some correspondence 
between communities and "ground truth" groups in social and biological networks [33] . 
For example, communities in social networks might correspond to circles of friends 
or business associates, communities in the World Wide Web might encompass pages 
on closely-related topics, communities in metabolic networks have been used to find 
functional modules |15| . and communities have been used to identify and measure 
political polarization in legislative processes in the U.S. Congress [37 1 138 ] . 

As discussed at length in two recent review articles [8l[33] and references therein, 
the classes of techniques available to detect communities are both numerous and di- 
verse; they include hierarchical clustering methods such as single linkage clustering, 
centrality-based methods, local methods, optimization of quality functions such as 
modularity and similar quantities, spectral partitioning, likelihood-based methods, 
and more. In addition to remarkable successes on benchmark examples, investiga- 
tions of community structure have led to success stories in diverse application areas — 
including the reconstruction of college football conferences [H] and the investigation of 
such structures in algorithmic rankings [B] ; the analysis of committee assignments |32| , 
legislation cosponsorship (38|, and voting blocs [37| in the U.S. Congress; the exami- 
nation of functional groups in metabolic networks |15| ; the study of ethnic preferences 
in school friendship networks |13| : and the study of social structures in mobile-phone 
conversation networks |31| . 

In this paper, we investigate the community structures of complete Facebook 
networks whose links represent reciprocated "friendships" between user pages (nodes) 
within each of five American universities during a single-time snapshot in September 
2005. Our primary aim in this paper is to use an unsupervised algorithm to compute 
the community structure — consisting of clusters of nodes — of these universities and 
to determine how well the demographic labels included in the data correspond to 
algorithmically computed clusters. We consider only ties between students at the 
same institution, yielding five separate realizations of university social networks and 
allowing us to compare the structures at different institutions. 

The rest of this paper is organized as follows. In Section [2l we describe our 
principal methods: the employed community-detection method, visual exploration 
of identified communities, and standardized pair-counting methods for quantitative 
comparison of communities with demographic data. We present more details about 
the data in Section [3| We then describe and discuss the results that we obtained for 
the five institutions in Section |4] before concluding in Section [5| 

2. Comparing Communities. A social network with a single type of connec- 
tion between nodes can be represented as an adjacency matrix A whose elements Aij 
give the weight of the tie between nodes i and j. The Facebook networks we study 
are unweighted, so Aij £ {0, 1}, where the value is 1 if a tie exists and if it does not. 
The resulting tangle of nodes and links, which we show for the California Institute of 
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Technology (Caltcch) Faccbook network in Fig. 12. 1[ can obfuscate any organizational 
structure that might be present. 




Fig. 2.1. [Color] (Left) A Pruchterman-Reingold visualization \10\l of the largest connected 
component of the Caltech Facebook network. Node shapes and colors indicate House affiliation 
(gray dots denote users who did not identify an affiliation), and the edges are randomly shaded for 
easy viewing. (Right) Magnification of a portion of the network. Clusters of nodes with the same 
color/shape suggest that House affiliation affects the existence of friendships/ edges. 

One approach to analyzing such data is to employ exponential random graph 
models (see, e.g., |35|). statistically fitting an underlying model for the presence of 
links. While such models (which can incorporate local network features) are poten- 
tially valuable for understanding the microscopic processes that underly the links 
between individual nodes, we take a different approach, focusing on groups of friends 
that form structural "communities" — groups of nodes that contain more internal con- 
nections (links between nodes in the group) than external connections (between nodes 
of the group and nodes in other groups) [51l33j . Our approach is motivated in part by 
the features of the Caltech data (discussed in detail in Sections [3] and H]). Although 
precise results obviously vary from one model specification to another, performing a 
logistic regression on the dyads (pairs of nodes) yields comparable coefficients for link 
presence between users from the same House as from the same high school. However, 
there are significantly more users sharing the former than the latter at Caltech. While 
common high school is unsurprisingly important at the dyadic level (in the rare cases 
that it happens), common House affiliation is apparently much more important for 
understanding structures that consist of larger groups of individuals. Accordingly, our 
goal in this section is to discuss how to compare the composition of algorithmically- 
determined communities to groups defined based on common user characteristics. 

We identify communities using spectral optimization [53] (followed by supplemen- 
tary Kernighan-Lin node-swapping steps |21[ ) of the "modularity" quality function 
Q = ~ )i where denotes the fraction of ends of edges in group i for which 

the other end of the edge lies in group j and 6; = ^ij is the fraction of all ends 
of edges that lie in group i. High values of modularity correspond to community 
assignments with greater numbers of intra-community links than expected at random 
(with respect to a particular null model [Sl[5ni[33] ) . Numerous other community de- 
tection methods are also available. However, our focus in the present paper is on 
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studying communities after they are obtained, and our methods can be apphed to 
the output of any community-detection algorithm in which each node is assigned to 
precisely one community. Such an assignment of nodes to communities constitutes 
a partition of the original graph. We seek a means to compare an algorithmically- 
obtained partition to partitions based on information that we have about Facebook 
user characteristics — class year, dormitory (House), high school, and major — as a 
means of exploring the roles of such characteristics in the social structures of each 
institution. An online social network is an imperfect proxy for an offline network, 
but our comparisons are nevertheless expected to yield interesting insights about the 
social life at the universities we study. 

2.1. Visual Comparisons. The demographic composition of communities is 
sometimes clear from visual inspection. This is the case with the community structure 
of the Calteeh network, which agrees closely with its undergraduate "House" system. 
In Fig. 12. 2[ we show a force-directed layout of Caltech's 12 communities (yielding 
a modularity of Q == 0.4002), which we show as pies with area proportional to the 
number of constituent nodes. Purple slices signify individuals who did not identify a 
House affiliation. 



Fig. 2.2. [Color] (Left) Force- directed layout of Calteeh eommunities, eaeh represented by a pie 
chart with area proportional to population and colored by House affiliation (with purple signifying 
missing information). (Right) Distribution of Rand coefficients comparing these 12 Calteeh commu- 
nities with random permutations of partitions into 9 House categories (including "Missing"). For 
comparison, we plot in red a Gaussian with the sample mean and variance. As our smallest data 
set, this yields the most extreme deviation from the Gaussian in our permutation tests. 

Unlike other universities (see Section we find that House affiliation is the 
primary organizing principle of the communities in the Calteeh network, which is 
what we expected because Caltech's House structure is so dominant socially. Indeed, 
each pie in Fig. 12.21 is dominated by members of one House. Moreover, many pies 
include a significant number of people who identify "Avery House" as their affiliation 
(dark blue), which is expected because of its different residency rules (members of 
all Houses could live in Avery at the time of this data). Given the promotion of 
Avery House to official House status after our data snapshot, it is natural to wonder 
if community detection on current data would find a community dominated by Avery. 
Investigating the formation of such a community using longitudinal data would be 
even more interesting, but is beyond the scope of our data. In principle, one can also 
make limited predictions based on the compositions of the communities about users 
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who did not volunteer their House afShation. 

Despite this demonstration of the utihty of visuahzing communities, it is typicahy 
necessary to perform quantitative analyses after detecting communities, as Caltech is 
unusual among universities in having a single characteristic that aligns so closely with 
its communities. For other institutions, we observe more heterogeneous communities, 
and it is typically difficult to visually assess which characteristics best correlate with 
the communities or even whether there is any strong correlation at all. To investigate 
the social organization of communities at such universities, it is thus essential to quan- 
titatively compare the detected communities with the available demographic groups. 
Such considerations apply broadly to community detection in most networks [33j . 

2.2. Pair Counting. As discussed in Refs. [lOlllSlj methods to compare graph 
partitions can be classified roughly into three groups: (1) pair counting, (2) cluster 
matching, and (3) information-theoretic techniques. Cluster matching might be par- 
ticularly problematic in the present context, as the numbers and sizes of groups vary 
significantly across the comparisons, which makes the essential identifications across 
partitions rather difficult. We focus on a collection of pair-counting methods, in part 
because of their convenient algebraic description, as one just needs to count the ways 
that pairs of nodes are grouped across two partitions. That same simplicity can also 
be a weakness, as it can present a serious interpretation difficulty because of the un- 
clear range of "good" scores. However, as we will show in Section [^31 standardization 
of pair-counting scores provides a unified interpretation of a number of seemingly dis- 
parate pair-counting measures and is particularly useful for the present setting. We 
also compare these results with those obtained using variation of information (VI) |26| . 

A pair-counting method defines a similarity score by counting each pair of nodes 
drawn from the n nodes of a network according to whether the pair falls in the same 
or in different groups in each partition. Pair-counting methods comprise a subset of 
a more general class of association measures that can be used for studying unordered 
(i.e., categorical) contingency tables [IHIIJUES] . We denote the counts of node pairs 
in each classification as wn (pairs classified together in both partitions), wip (same in 
the first but different in the second), wqi (different in the first but same in the second), 
and Woo (different in both). The sum of these quantities is, by definition, equal to the 
total number Af of node pairs: M = wn -|-z«io -t-woi -l-woo = (2) =n(n — \)/2. Given 
two partitions of a network, one can obtain many different pair-counting similarity 
coefficients using different algebraic combinations of the counts. 

We first consider the Rand similarity coefficient S'r = (wn -I- ■woo)/M [34], which 
counts the fraction of node pairs identified the same way by both partitions (either 
together in both or separate in both). Bounded between (no similar pair placements) 
and 1 (identical partitions), the Rand coefficient is extremely intuitive and can be 
used fruitfully in many settings. However, it has an important deficiency: The Rand 
coefficient for two network partitions that each contain large numbers of categories is 
skewed towards the value 1 because of the large fraction of node pairs that are placed 
in different groups even when comparing two partitions with little in common. 

If one wishes to exclude woq from having an explicit role, one can use the Jac- 
card index Sj = wii/{wii + wio + wqi) or the Fowlkes-Mallows similarity coefficient 
Sfm = wii/ ^ {w\\ + wio)(wii + woi)- Both 5j and S'fm clearly avoid the problem- 
atic effects of large woo, but their ignorance of node pairs classified similarly into 
different communities yields overly high values when comparing network partitions 
with very few categories (or when one partition consists of a single group). Another 
index is the Minkowski coefficient S'm = \/ (wio -f wqi)/ (w\q + wn), which is asym- 
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metric in its consideration of the two partitions. The first serves as a distinguished 
reference, measuring the number of mismatches relative to the number of similarly- 
grouped pairs in that reference. Hence, Sm values closer to are considered better. 
The r similarity coefficient, defined as 

Mwii - {wii + wio){wii + woi) 
r — — . , 
^(wii + wio)(wii + wni){M - {wii + wio)){M - {wn + wqi)) 

has the most complicated algebraic form of the similarity coefficients that we employ 
Additional measures and discussions are available in Refs. f7lfT9 l [26 ] . Notably, each Si 
measure suffers from the difficulty of it being unclear what constitute "good" values, 
as they all depend intimately on the numbers and sizes of the groups in the partition. 
(We illustrate this in SectionUwith computations for the Caltcch network and discuss 
further properties of the similarity indices in Subsection 12.31 ) 

One can also try to alleviate the problem of identifying good similarity values 
by introducing various "adjusted" indices that report comparisons as a similarity 
relative to that which might be obtained at random. For instance, one can construct 
adjusted indices by subtracting the expected value (under some null model, typically 
conditional on maintaining the numbers and sizes of groups in the two partitions) and 
then rescaling the result by the difference between the maximum allowed value and 
the mean value jl8j . One such index, using a bound on the maximum allowed value, 
is the Adjusted Rand coefficient [TH| 

^ _ Wll - Jjjwii + Wlo)(wil + Wqi) 

i [{Wii + Wiq) + {Wii + IVoi)] - Jj{wii + Wio){wii + Wqi) 

As described in Rcf. [26], adjusted indices can be problematic because the focus 
on the maximum possible values does not guarantee accurate comparisons between 
similarity coefficients across different settings. In particular, this implies that one can- 
not necessarily use similarity scores to make direct comparisons between communities 
and House with those between communities and high school (which is something that 
we specifically aim to do). That is, even if such comparisons yield Adjusted Rand 
values of 0.1 and 0.2, it is not at all clear that the second situation should be construed 
to yield a closer pair of partitions than the first. Consequently, the general problem 
of knowing what similarity-score values indicate a good correlation remains. 

2.3. Standardized Pair Counting. Numerous studies have attempted to as- 
sess the utility of similarity measures. However, because partitioning according to 
demographic traits yields a graph partitioning that typically differs significantly from 
that obtained using algorithmic community detection, we use a classical statistical 
approach, advocated in [3l[9], wherein similarity measures are used in the context of 
testing significance levels of the obtained values versus those expected at random. We 
recommend using a proper metric (i.e., a quantity that is a metric in the mathemati- 
cal sense rather than only in an informal sense) such as variation of information [26] 
for comparing partitions that are close to one another. However, in the Facebook 
networks, the mutual information of a pair of partitions is small compared to the 
total information in each. In such cases, two partitions can be relatively far from each 
other according to a distance measure but might nevertheless be very far in the tail 
of the distribution of what can be expected at random. It is consequently more ap- 
propriate to identify the pair-counting strength relative to that obtained at random, 
standardized by the width of the distribution via z-scorcs Zi = (Si — ^ii)/ai, which 
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indicate the number of standard deviations cr,; that the 5i-vahie is more correlated 
than the mean /i^ (i G {FM, F, J, M, R, AR}, noting the need to muhiply by —1 for 

One can obtain z-scores non-parametricahy using permutation tests [14| , though 
we will identify analytical formulas for zr and show that the Fowlkes-Mallows, F, 
Rand, and Adjusted Rand z-scores are identical. The elements riij of the contingency 
table indicate the number of nodes that are classified into the ith group of the first 
partition and jth group of the second partition. As long as partitions are constrained 
to have the same numbers and sizes of groups as the original partitions — i.e., as long 
as the row and column sums, Ui. = riij and n.j = n.y , remain constant — then 
the total number of pairs M, the number of pairs Mi = J^i ("2 ) classified the same 
way in the first partition, and the analogous quantity M2 = J2j ("2^) the second 
partition likewise remain constant. This implies that any pair-counting index specified 
by Wafs counts can be equivalently specified in terms of only w ;= wu = '^■^ ("2') 
because wio = Mi — w, wqi = M2 — w, and wqo = M — Mi — M2 + w. It follows 
immediately that Sn, Sfm, Sr, Sar are each linear functions of w and hence linear 
functions of each other |19| . Any similarity index 5*^ that is a linear function of w must 
be statistically equivalent to w in any null model (given constant M, Mi, and M2), 
with the z-score and p- value equal to that associated with the specified w. Meanwhile, 
as we demonstrate in Section lU the St values can have different orderings in different 
comparisons because of their dependence on M, Mi, and M2. 

It is also instructive to note the relationships between the linear-in-ui similarity 
coefficients and the Jaccard and Minkowski indices: 1/Sj = — 1 + {Mi + M2)/w 
and 5^ = {Mi + M2 — 2w)/Mi. The asymmetry in the Minkowski index is clearly 
limited, as switching which partition is the reference changes the coefficient by a 
multiplicative factor. Because the square root and multiplicative inverse are both 
monotonic operations in the domains of these indices {Sm > and < S*,) < 1), it 
follows that the p- values of the cumulative distributions of each are identical to the 
p- value of w itself even though the corresponding z-scores can be different. 

In deference to the seminal presentation of the Rand index [34], we refer to the 
z-score of the linear-in-it; scores as z-Rand: zr — {w — fJ.w)/'^w, where fi^ and are, 
respectively, the mean and standard deviation of w (noting its equivalence by linear- 
ity to the z-score advocated explicitly by Brennan and Light [5]). In the absence of 
external information that indicates a need to impose specific correlations, we adopt 
the standard and analytically tractable assumption of a random hypergeometric dis- 
tribution of equally likely assignments subject to fixed row and column sums. The 
expected value then becomes = M1M2/M, as for the adjusted Rand index [T8] . 
The calculation of higher-order moments is more involved I3121[T71[23j . In order to 
make zr as simple as possible to calculate, we rewrite the formulas of |17) as follows: 

1 / MiMA 



2 M (4Mi - 2M)2(4M2 - 2M)^ C1C2 



16 256A//2 16?i(n- l)(7i-2) 

[{U-Ii - 2Mf - 4Ci - 4M][(4A'/2 - 2Mf - 4C2 - 4M] 
64n(ri - l){n - 2){n - 3) 



(2.2) 



Ci — n{n'^ — in 



2)-8(n + l)A/i+4^n3 , 



C2 = — 3n 



2)-8(n + l)Af2+4^n3-. 



(2.3) 



While wc advocate the use of z^i, their associated significance levels (equivalently, 
the p-values of the cumulative distribution) are not equal to those for a Gaussian 
distribution. The distribution for large samples is asymptotically Gaussian [22], but 
the distribution associated with comparing a particular pair of partitions need not 
be. Indeed, the tails of the distribution can be quite heavy [i], so the probability of 
obtaining extreme z-scores can be orders-of-magnitude higher than in the normal dis- 
tribution. Nevertheless, the Gaussian approximation is frequently sufficient to gauge 
statistical significance (past the 95% confidence interval). Given the straightforward 
calculation of (|2.1|) - p.3p . we prefer to use zr directly, with the caveat that the Rand 
indices do not translate directly to p- values. 

Where simple formulas for the necessary moments do not appear to be avail- 
able (i.e., for the Jaccard and Minkowski indices), we resort to the computationally 
straightforward (albeit intensive if one desires high accuracy) method of examining 
distributions obtained using permutation tests [M], again under the null model of 
equally-likely node assignments conditional on the constancy of the numbers and 
sizes of groups. Specifically, starting from two network partitions whose correlation 
we want to measure, we calculate the similarity values Si and obtain a context for 
these values by repeatedly computing Si under random permutation of the node as- 
signments in one of the partitions. (Subsequent permutation of assignments in the 
second partition is redundant.) We thereby aim to compare the similarity coefficients 
between the two partitions to the distributions of such coefficients from the appropri- 
ate ensemble of partition pairs. Numerical estimation of p- values far in the tail of the 
distribution (where many of our points of interest lie) necessarily requires sampling 
a correspondingly large number of elements. In contrast, calculating z-scores only 
requires sampling the first two moments of the distribution. We typically use 10000 
permutations (even for the larger networks, where the number of nodes is actually 
larger than the number of permutations considered), confirming that the obtained 
z-scores have converged to roughly two significant figures by comparing them with 
those obtained using half of the permutations and also comparing zr estimates with 
the analytical values obtained from (|2.ip - (|2.3p . 

Of course, calculating z-scores of the pair-counting indices is not a panacea, par- 
ticularly when comparing networks of different sizes. Nevertheless, we find them to 
be exceptionally useful for examining the correlations between communities and par- 
titions by the available demographics in our Facebook data. Before wc concentrate on 
using these z-scores to measure correlations, we compare test results (similar to those 
discussed in Section H]) against other methods, including variation of information [26] 
and the (non-standardized) Adjusted Rand index S'ar }18j using a scatter plot versus 
zr in Fig. 12.31 While 5ar trends positively with zr (recall that zr = zar), there are 
clearly situations with very small S'ar that have much larger zr values than should 
be expected at random. We additionally observe that zj and zm each appear to be 
closely approximated by zr at the scale of Fig. 12.31 though closer inspection reveals 
relative differences occasionally as large as 10%. 

We admit that we are questionably guilty of one of the major sins of statistical 
analysis, in that z-scores are typically a proxy for the likelihood with which one 
can reject an independent null hypothesis. It is thus reasonable to question their 
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Fig. 2.3. Scatter plot of (the Rand z-score) on the horizontal axis versus (on the vertical 
axis) other pair-counting z-scores (zj and zmJ, variation of information (VI), a VI z-score from 
permutation tests, and the Adjusted Rand index 5ar. The depicted data comes from 60 situations: 
algorithmically-detected communities for the 5 universities using 4 demographic groupings and 3 
networks per university (full data and gender-restricted networks of women only and men only). 



effectiveness for the quite different task of measuring a correlation. We stress, however, 
that the underlying statistic that we have standardized is a pair counting of the 
similarities between partitions rather than a deviation from independence. (We 
note that w reduces to a linear function of in the special case of uniform constant 
marginals [4].) Therefore, in the absence of enforcing a particular model for the form 
of the correlation between partitions, we believe this standardization of similarity 
scores is a reasonable way to proceed (if done so with caution) . 

3. Data. Our data, which was sent directly to us in anonymized form by Adam 
D'Angelo of Faccbook, consists of the complete set of users (nodes) from the Facebook 
networks for each of five American universities and all of the links between those users' 
pages for a single-time snapshot from September 20050 Similar snapshots of Facebook 
data from 10 Texas universities were analyzed recently in Ref. |25| . and a snapshot 
from "a diverse private college in the Northeast U.S." was studied in Ref. [23]. Other 
studies of Facebook have typically obtained data either through surveys [2] or through 
various forms of automated sampling |12| , thereby containing missing nodes and links 
that can strongly impact the resulting graph structures and analyses. 

We consider only ties between people at the same institution, which yields five 
separate realizations of university social networks and allows us to compare the struc- 
tures at different institutions. Our study includes a small technical institute (Cali- 
fornia Institute of Technology [Caltech]), a pair of private universities (Georgetown 
University and Princeton University), and a pair of large state universities (University 



We have posted the data at |http : //people .maths ■ ox . ac.uk/-porterm/data/facebook5 . zip] 
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Caltech Georgetown Oklahoma Princeton UNC 




k/<k> 

Fig. 3.1. Cumulative degree distributions (top panels) and local clustering coefficients (bottom 
panels) for the five university networks. We employ semilogarithmic coordinates. The horizontal 
axes give the degree relative to the mean degree {k), and vie only display data for k/{k) < 8 to 
provide common axes for all universities. 

of Oklahoma and University of North Carohna at Chapel Hill [UNC]). 

We summarize basic properties of the university networks in Fig. 13.11 and Table 
13.11 See [IHHSm and references therein for discussions of the measures that we use 
in this section. Although our focus in this paper is community structure, we remark 
that even these simple network characteristics can yield insights about Faccbook net- 
works. The mean degrees tend to increase with network size, potentially indicating 
that broader institutional use begets greater personal use (though this trend is clearly 
strongly influenced by the Caltech data). The degree distributions of these institutions 
(plotted in the top panels of Fig. 13. ip have heavy tails compared to random graphs. 
In particular, the degree distributions appear to be approximately exponential. Al- 
though the mechanisms driving such distributions are impossible to ascertain without 
longitudinal data, the roughly exponential form of the degree distribution both above 
and below the mean degree potentially indicates a wide range in the willingness to 
participate (i.e., to add online friends) among Facebook users. 

The bottom panels of Fig. 13.11 compare node degree versus clustering coefficient, 

number of pairs of neighbors of node i that are connected 

Oj -— ; ; . 

number of pairs of neighbors of node i 

We note that even heavy users have much larger local clustering than that expected 
at random (e.g., when compared with the total graph densities). In Table [XTl we 
provide the mean clustering coefRcient and the transitivity for each network, given 
by the fraction of connected triples in the network that are fully connected triangles. 
Both measures of local clustering are much larger at Caltech than they are at the 
other institutions. It is of course not surprising that we observe large transitivities 
in social networks such as the Facebook networks. Nevertheless, as we have shown 
recently in Ref. [27], tree-based theories of various dynamical processes appear to be 
valid for Facebook networks (despite their high clustering, implying that they are most 
definitely not locally tree-like) because they are "sufficiently small" worlds, in that 
the mean distance between nodes is close to the expected value obtained in random 
networks with the same joint degree-degree distributions. 

The data also includes limited demographic information provided by users on their 
individual pages: gender, class year, and data fields that represent (using anonymous 
numerical identifiers) high school, major, and dormitory residence (or "House" at 
Caltech). In situations in which individuals elected not to volunteer a demographic 
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Institution 


Caltcch 


Georgetown 


Oklahoma 


Princeton 


UNC 


Nodes 


1099 


12195 


24110 


8555 


24780 


Connected Nodes 


762 


9388 


17420 


6575 


18158 


Connected Edges 


16651 


425619 


892524 


293307 


766796 


Mean Degree 


43.7 


90.7 


102.5 


89.2 


84.5 


Mean Clustering Coeff. 


0.4091 


0.2249 


0.2297 


0.2372 


0.2020 


Transitivity 


0.2913 


0.1485 


0.1587 


0.1639 


0.1156 


Degree Assortativity 


-0.0662 


0.0753 


0.0737 


0.0910 


6.6x10"^ 


Gender Assortativity 


0.0540 


0.0145 


0.1118 


0.0650 


0.0598 


Major Assortativity 


0.0382 


0.0439 


0.0412 


0.0474 


0.0511 


Dormitory Assortativity 


0.4486 


0.1725 


0.4033 


0.0872 


0.2024 


Year Assortativity 


0.2694 


0.5575 


0.2923 


0.4947 


0.3964 


High School Assortativity 


0.0021 


0.0237 


0.1583 


0.0197 


0.1342 


Number of Communities 
Modularity 


12 
0.4003 


33 
0.4801 


5 

0.3869 


12 
0.4527 


5 

0.4274 



Table 3.1 



Basic characteristics of the largest connected components of the five Facebook networks that we 
study: the total number of nodes in the original data, numbers of nodes and edges in the largest 
connected component, mean degree, mean clustering coefficient, transitivity (fraction of transitive 
triples), assortativities (by degree, gender, major, dormitory, class year, and high school), number 
of communities detected, and the modularity of the resulting graph partition. In calculating the 
assortativities, we ignored nodes for which the corresponding demographic characteristic is missing 
(i.e., the "pairwise removal" protocal that we discuss in Section We treat class year as a 

categorical variable here, and we calculate degree assortativity as a correlation coefficient \28\\ SCX . 



characteristic, we use an additional "Missing" label. These characteristics allow us 
to make comparisons between different universities, under the assumption (per the 
discussion in Ref. [2]) that the communities and other elements of structural organi- 
zation in Facebook networks reflect (even if imperfectly) the social communities and 
organization of the offline networks on which they're based. 

For instance, at the level of individual ties, the tendency for users to be friends 
with other users who have similar characteristics can be quantified by the assortativity 
of the links relative to that characteristic. Degree assortativity (or degree correlation) 
can be calculated as the Pearson correlation coefficient of the degrees at either ends 
of the edges. Although many social networks tend to be positively assortative with 
respect to degree, we find that the degree assortativity is negative for Caltech and is 
very small for UNC. A general measure of scalar assortativity relative to a categorical 
variable is given by 

tr(e)-||e^|| ^ 

^ = ^— iPl^ e [-1,1J , (3.1) 

where e — E/||E|| is the normalized mixing matrix, the elements Eij give the number 
of edges in the network that connect a node of type i (e.g., a person with a given 
major) to a node of type j, and the entry- wise matrix 1-norm ||E|| is equal to the 
sum of all entries of E. Comparing assortativities for various categories shows, for 
example, that assortativity by dormitory and class year (treated as a categorical 
variable) are high for all five institutions; assortativities by major are low for all 
five institutions; and assortativities by high school and gender are less consistent 
across institutions. The relative sizes of the different assortativities also vary across 
institutions, which is similar to what we will see below with communities. Going 
beyond this measure of local assortativity by characteristics, our major focus for this 
article is on the organization of the communities of these five Facebook networks based 
on these various categories. We discuss this in detail in Section 
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Sar 


Sfm 


Sr 


Sj 




Sn 


VI 


zj 


2M 




"Major" 


0.0063 


0.1195 


0.0070 


0.0576 


1.1238 


0.7785 


4.3149 


3.96 


3.95 


3.96 


"House" 


0.3762 


0.4742 


0.3829 


0.3056 


0.9578 


0.8391 


1.9275 


249 


226 


198 


"Year" 


0.0080 


0.1766 


0.0080 


0.0968 


1.2637 


0.7199 


3.5191 


6.84 


6.82 


6.73 


"H.S." 


0.0085 


0.0833 


0.0129 


0.0301 


1.0484 


0.8072 


4.7268 


-0.55 


-0.55 


-0.55 



Table 4.1 

Similarity coefficients (Adjusted Rand, Fowlkes- Mallows, T, Jaccard, Minkowski, and Rand), 



variation of information, and similarity z-scores for comparing a 12-community partition of the 
Caltech data versus a partition constructed using each of the four self-identified user characteristics. 



4. Facebook Communities. We algorithmically identify a set of communities 
in the largest connected component of each institution's network using a modified 
version of Newman's leading-eigenvector method [29| in conjunction with subsequent 
Kernighan-Lin node-swapping steps j21j . We compare the communities to partitions 
obtained by grouping users according to each of the self-identified characteristics: 
major, class year, high school, and dormitory/House. 

We first revisit Caltech's community structure, which we previously examined 
visually in Fig. 12.21 The partition of the largest connected component into 12 com- 
munities (which has modularity Q = 0.4003) exhibits a strong correlation with House 
afiiliation. To investigate this quantitatively, we calculate the similarity coefficients 
of this partition versus each partition constructed using one of the four available user 
characteristics (see Table HTTj) . The raw Si values appear to be insufficient to the task 
of comparing these communities. Specifically, the ordering of the correlation strengths 
with the different demographics is not consistent across pair-counting indices, even 
among those we know are linear transformations of one another. Additionally, al- 
though there is agreement that the correlation with House is strongest, the Si values 
differ wildly in how much they set apart the House correlation, with S'r and 5m seem- 
ingly indicating that the correlation with House is only marginally stronger than that 
with high school even though Caltech contains very few students at one time that 
come from the same high school. 

These apparent disagreements in interpretation across Si values occur even though 
we know that their corresponding p-values in the (unobtained) random distributions 
are identical. While we cannot directly calculate those p-values, the z-scores for each 
(see Section I2.3P in Table 14.11 indicate that the correlation with high school is the 
only one of the four demographic characteristics that is not statistically significant. 
We note that the ordering of the VI scores in Table 14.11 is consistent with that of 
the z-scores but recall that such agreement of ordering is not consistently observed 
in Fig. 12.31 The z-scores provide a consistent interpretation of the roles of the four 
characteristics in this Caltech data: House is most important, followed distantly by 
year and major (in descending order), with no significant correlation with high school. 
Because of the close agreement between the zj, zm, and zr scores in Fig. l2.3l and Table 
14.11 we henceforth restrict attention to the analytically-obtained zr values. 

Before concluding our discussion of Caltech, we acknowledge the potentially im- 
portant effects of missing demographic data, as a significant number of users did not 
volunteer an affiliation (as indicated in Table |4?2] and by the purple wedges of Fig. l2.2|) . 
One can approach the issue of missing data using sophisticated tools such as multiple 
imputation, likelihood, or weighting methods |16j . A simpler approach is to inves- 
tigate the effects on the measured correlations by various restrictions of the data. 
We consider three such protocols: inclusion, pairwise removal, and listwise removal. 
Inclusion, which we use in Table 14.11 treats the missing labels like any other category, 
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Connected 


Indicated 


Indicated 


Indicated 


Indicated 


Indicated 




Users 


Major 


Dorm/Housc 


Year 


High School 


All 


Caltech 


762 


687 


594 


651 


633 


499 


Georgetown 


9388 


7510 


6594 


8374 


7562 


4774 


Oklahoma 


17420 


15779 


7203 


13732 


14998 


5510 


Princeton 


6575 


4940 


4355 


5801 


5214 


2920 


UNO 


18158 


15492 


8989 


15883 


15414 


6719 



Table 4.2 

Numbers of nodes of each data set used in the different protocols for treating missing data. 





Caltech 


Georgetown 


Oklahoma 


Princeton 


UNO 


Inclusion: "Major" 


3.962 


5.885 


3.799 


15.03 


8.044 


"Dorm/House" 


200.8 


148.8 


71.00 


58.26 


113.0 


"Year" 


6.727 


1543 


206.7 


1058 


778.2 


"High School" 


-0.553 


26.13 


18.50 


15.62 


15.93 


Pairwise: "Major" 


4.051 


16.00 


16.44 


9.968 


5.700 


"Dorm/House" 


285.3 


212.9 


186.9 


147.2 


93.34 


"Year" 


5.389 


1837 


286.1 


1270 


889.1 


"High School" 


0.7695 


4.247 


22.54 


2.888 


37.22 


Listwise: "Major" 


2.235 


15.23 


26.10 


10.07 


13.90 


"Dorm/House" 


248.9 


221.5 


159.9 


116.5 


90.50 


"Year" 


2.644 


1913 


251.2 


997.3 


475.7 


"High School" 


0.3063 


1.228 


13.69 


2.415 


21.12 



Table 4.3 

Analytically-obtained zj^-scores for comparing the algorithmically-identified communities of 
Facebook networks versus user characteristics. Cases where users did not volunteer demographic 
characteristics are treated by three protocols: inclusion, pairwise removal, and listwise removal. 



erroneously grouping all such users together in the demographic partition. We apply 
pairwise removal separately for each demographic comparison with the community 
structure. In terms of a contingency table of r demographic rows and c community 
columns, this amounts to a deletion of the row corresponding to "Missing." Listwise 
removal restricts the comparisons to the subset of users who volunteered all four of the 
studied demographic characteristics. We stress that these protocols do not affect the 
community assignments, which we obtained using the complete network data. Other 
restrictions or combinations of this data (such as single-gender restrictions) can also 
be fruitfully explored, but such investigations are beyond the scope of the present 
article. 

In Table l473l we present the ZR-scores for all four community-demographic com- 
parisons using each of the three missing data protocols at the five universities we study. 
We caution that because of network-size effects (reflecting the different numbers of 
nodes in different examples), z-score values cannot typically be directly compared 
across institutions. Accordingly, our primary conclusions are about the statistical 
significances and rank orderings of the demographic correlations separately in each 
university. Our previous conclusions about the Caltech community structure remain 
largely consistent across all three missing data protocols: House is most strongly cor- 
related with the communities, followed distantly by year and major (in descending 
order) , with no statistically significant correlation with high school. While House re- 
mains strongly correlated with communities in all three protocols, the correlation with 
year and major appears to be only marginally statistically significant in the analysis 
with listwise removal. 
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In contrast with Caltech, the communities at each of the other four institutions 
that we study correlate primarily with class year (see Table . Moreover, these 
correlations are not as dominant as House is at Caltech, as each of the four charac- 
teristics possess statistically significant correlations with the community structures at 
the other four institutions (except high school in listwise removal at Georgetown) . We 
show the 12 communities identified at Princeton colored both by class year and by 
major in Fig. 14.11 Compared with the strong correlation between communities and 
House affiliation at Caltech, these visual depictions of the Princeton communities do 
not seem to indicate as strong a correlation with year despite the very large corre- 
sponding 2r (which again cautions against direct comparison of zr values in networks 
of different sizes). We remark that the size of the Princeton data set. with over 8500 
nodes (6575 in the largest connected component) is disproportionately large relative 
to the institution's size; this is presumably a result of the relatively early Facebook 
adoption there. 

The z-scores in Table 14.31 reveal that Princeton students break up into communi- 
ties primarily according to class year (among the four demographic categories available 
to us), and dormitory gives the second highest correlation. While major is also sig- 
nificant, the correlation with high school appears to be only marginally significant in 
protocols that remove missing data. One can draw similar conclusions about George- 
town from Table H31 the only qualitative difference is the possible lack of significance 
of high school at Georgetown (as compared to the marginal significance at Princeton) 
that is suggested by the more stringent missing-data protocols. 




Fig. 4.1. [Color] Pie- charts of Princeton, colored by (Left) class year and (Right) major. (As 
before, purple slices correspond to people who did not identify the relevant characteristic.) 

Similarly, the z-scores calculated for the UNC network partitioned into 5 com- 
munities suggest that class year is the primary organizing characteristic and that 
dormitory residence is also prominent. High school and major have smaller but sig- 
nificant positive correlations with the community structure. The other large state 
university that we consider is the University of Oklahoma, which is also partitioned 
into 5 communities. Like UNC, the dominant correlation of the Oklahoma communi- 
ties is with year, the secondary correlation is with dormitory, and both high school and 
major have statistically significant correlations. Unlike UNC, however, the disparity 
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between the correlations with year and with dormitory do not appear to be as wide 
at Oklahoma. In contrast to Princeton and Georgetown, communities at both UNC 
and Oklahoma maintain unquestionably significant correlations with high school in 
both missing-data protocols. 

We close this section by cautioning about interpretations of conclusions drawn 
from the numbers in Table l473l even though they indicate some interesting differences 
among the institutions that we studied. In particular, one should of course be care- 
ful about how such numbers might be influenced by our methodologies. Although 
we have provided three different protocols for handling missing data, other effects 
might be similarly worthy of study. For instance, one should be wary of the possible 
influence of the selected definition of "community" and the method of its detection. 
There are numerous definitions and methods available (again see Refs. [51133]). and a 
more definitive analysis of the connections between communities and characteristics 
in such networks should more fully explore multiple notions of community, possibly 
hierarchical structures, and communities at different resolutions. 

As a simple example of comparing results from different community-detection 
methods, we compare the 12-community Caltech partition with that obtained for a 
7-community partition (with Q = 0.3594), which we obtained using a simpler spectral 
modularity-optimization implementation. Despite the necessarily different details of 
these two community structures, the qualitative conclusions from the two partitions 
are the same: House provides the dominant correlation, followed distantly by year 
and major, and there is again no significant correlation high school. Applying this 
same "weaker" (in the sense of consistently resulting in partitions of lower modular- 
ity) community-detection implementation to the other four institutions also typically 
agrees with the results that we report above: Year has the strongest correlation with 
communities and is followed by dormitory. The role of high school appears to be 
more pronounced in these lower-modularity partitions, as one obtains statistically 
significant correlations with the communities at Georgetown and Princeton and even 
stronger correlations with the communities at UNC and Oklahoma. 

We also stress the difference between causation and correlation. In this paper, we 
have examined correlations. As discussed in the sociological literature on SNSs (see [2] 
and references therein), it is obviously very interesting and important to attempt to 
discern which common characteristics have resulted from friendships and which ones 
might perhaps influence the formation of friendships. In terms of the individual 
characteristics discussed above, high school and class year are known prior to the 
formation of these Faccbook links, so one would expect those particular correlations 
to also indicate how some friendships might have formed. Common residences and 
majors, on the other hand, can both encourage new friendships and arise because of 
them. We note, finally, that SNS friendships provide only a surrogate for offline ones, 
so that one can also expect to find some differences between the community structures 
of Faccbook networks and the real- life networks that they imperfectly represent [2] . 

5. Conclusions. We have demonstrated that analysis of community structure 
is useful for studying the online social networks of universities and inferring interest- 
ing insights about the prominent driving forces of community development in their 
corresponding offline social networks. We investigated various measures for compar- 
ing algorithmically-identified communities in Facebook networks with those obtained 
by grouping individuals according to self-identified characteristics. We found that 
z-scores of pair-counting indices provide an immediate (though not quantitatively 
perfect) interpretation about the likelihood that such values might arise at random, 
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indicating significant correlations between the algorithmically-identified communities 
and multiple self-identified characteristics. Such calculations indicate that the orga- 
nizational structure at Caltech, which depends very strongly on House affiliation, is 
starkly different from those of the other universities that we studied. The observed 
heterogeneity in the communities, even at an institution like Caltech whose social 
structure seems to be mostly dominated by a single feature (House affiliation), un- 
derscores the important point that networks typically have multiple organizational 
forces We hope that our work leads to a wider comparative study that might 

increase understanding about the different factors that drive the social organization 
of universities. The present paper attempts to provide foundational steps for such 
comparative investigations by conveying a meaningful methodology. 
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